StackOverflow is a question-and-answer website for programmers. Each question can have up to 5 descriptive tags for people to be able to find related tags. In this demo, we will look at 50,000 questions with more than 3 tags to try to visualize what StackOverflow is all about. The data came from http://data.stackexchange.com/stackoverflow/query/new with a query like this:
select tags from posts where PostTypeId=1 and tags like '%><%><%'
The notebook cells below use pymldb
's Connection
class to make REST API calls. You can check out the Using pymldb
Tutorial for more details.
In [1]:
from pymldb import Connection
mldb = Connection()
The sequence of procedures below is based on the one explained in the Mapping Reddit demo notebook. In pseudo-code:
dataset so_raw = import("http://public.mldb.ai/stackoverflow.csv.gz")
dataset so_tags = tokenize(so_raw)
dataset so_counts = count(transpose(so_tags))
dataset so_svd_embedding = svd_embed(so_tags)
dataset so_kmeans_clusters = kmeans_cluster(so_svd_embedding)
dataset so_tsne_embedding = tsne_embed(so_svd_embedding)
In [2]:
# dataset so_raw = import("http://public.mldb.ai/stackoverflow.csv.gz")
print mldb.put('/v1/procedures/import_so_raw', {
"type": "import.text",
"params": {
"dataFileUrl": "http://public.mldb.ai/stackoverflow.csv.gz",
'delimiter':'',
'quoteChar':'',
'outputDataset': 'so_raw',
'runOnCreation': True
}
})
# dataset so_tags = tokenize(so_raw)
print mldb.post('/v1/procedures', {
"id": "so_import",
"type": "transform",
"params": {
"inputData": "select tokenize(lineText, {splitChars: '\"\r<>', quoteChar: ''}) as * from so_raw",
"outputDataset": "so_tags",
"runOnCreation": True
}
})
# dataset so_counts = count(transpose(so_tags))
print mldb.post('/v1/procedures', {
"id": "so_counts",
"type": "transform",
"params": {
"inputData": "select columnCount() AS numQuestions from transpose(so_tags)",
"outputDataset": "so_counts",
"runOnCreation": True
}
})
# dataset so_svd_embedding = svd_embed(so_tags)
print mldb.post('/v1/procedures', {
"id": "so_svd",
"type" : "svd.train",
"params" : {
"trainingData" : """
select
COLUMN EXPR (AS columnName() ORDER BY rowCount() DESC, columnName() LIMIT 6000)
from so_tags
""",
"columnOutputDataset" : "so_svd_embedding",
"numSingularValues": 100,
"runOnCreation": True
}
})
# dataset so_kmeans_clusters = kmeans_cluster(so_svd_embedding)
print mldb.post('/v1/procedures', {
"id" : "so_kmeans",
"type" : "kmeans.train",
"params" : {
"trainingData" : "select * from so_svd_embedding",
"outputDataset" : "so_kmeans_clusters",
"numClusters" : 20,
"runOnCreation": True
}
})
# dataset so_tsne_embedding = tsne_embed(so_svd_embedding)
print mldb.post('/v1/procedures', {
"id": "so_tsne",
"type" : "tsne.train",
"params" : {
"trainingData" : "select * from so_svd_embedding",
"rowOutputDataset" : "so_tsne_embedding",
"runOnCreation": True
}
})
In the program above we built three datasets:
so_tsne_embedding
: contains x
and y
coordinates per tagso_kmeans_clusters
: contains a cluster
per tagso_counts
: contains numQuestions
per tagNow we query out a pandas
dataframe with the merge()
of these three datasets, along with a 'grid' coordinate for each tag which we will use to automatically position labels in the visualization below.
In [4]:
df = mldb.query("""
select c.* as *, m.* as *, quantize(m.x, 7) as grid_x, quantize(m.y, 7) as grid_y
named c.rowName()
from merge(so_tsne_embedding, so_kmeans_clusters) as m
join so_counts as c on (c.rowName() = m.rowPathElement(0))
where m.cluster is not null
order by c.numQuestions desc
""")
df.head()
Out[4]:
In [5]:
import numpy as np
colormap = np.array([
"#1f77b4", "#aec7e8", "#ff7f0e", "#ffbb78", "#2ca02c",
"#98df8a", "#d62728", "#ff9896", "#9467bd", "#c5b0d5",
"#8c564b", "#c49c94", "#e377c2", "#f7b6d2", "#7f7f7f",
"#c7c7c7", "#bcbd22", "#dbdb8d", "#17becf", "#9edae5"
])
import bokeh.plotting as bp
from bokeh.models import HoverTool
In [6]:
#this line must be in its own cell
bp.output_notebook()
In [7]:
x = bp.figure(plot_width=900, plot_height=700, title="Map of StackOverflow Tags",
tools=[HoverTool( tooltips=[ ("tag", "@tags") ] )], toolbar_location=None,
x_axis_type=None, y_axis_type=None, min_border=1)
x.scatter(
x = df.x.values,
y = df.y.values,
color=colormap[df.cluster.astype(int).values],
alpha=0.6,
radius=(df.numQuestions.values ** .35)/5,
source=bp.ColumnDataSource({"tags": df.index.values})
)
labels = df.reset_index().groupby(['grid_x', 'grid_y'], as_index=False).first()
labels = labels[labels["numQuestions"] > 200]
x.text(
x = labels.x.values,
y = labels.y.values,
text = labels._rowName.values,
text_align="center", text_baseline="middle",
text_font_size="8pt", text_font_style="bold",
text_color="#333333"
)
bp.show(x)
Out[7]:
Check out the other Tutorials and Demos.
In [ ]: